All Databases MacTech Vol 07-1991

Beat the Optimizer

Volume Number: 7

Issue Number: 4

Column Tag: Fortran's World

Beat The Optimizer!

By Jonathan Bell, Clinton, SC

Note: Source code files accompanying article are located on MacTech CD-ROM or

source code disks.

Beat the Optimizer!

[Jon earned a Ph.D. in elementary particle physics from the University of

Michigan by doing a lot of number-crunching in FORTRAN on a VAX. Now he teaches

physics and computer science at Presbyterian College in Clinton, SC, where he first

saw a Macintosh and fell in love with it at first sight. Although he now does much of his

teaching and programming in Pascal, and dabbles in assembler, he’s still fond of

FORTRAN, e specially the dialects with post-FORTRAN-77 extensions.]

Introduction

In heavy-duty number-crunching applications, the efficiency of the object code

is a major consideration. For this reason, minicomputer and mainframe compilers can

usually optimize the code in various ways. Most Macintosh compilers until recently

have not done a very good job with optimization, but this is starting to change. One

example of this is the Language Systems FORTRAN compiler, version 2.0, which runs

under the Macintosh Programmer’s Workshop (MPW). As Jörg Langowski has pointed

out (MacTutor v6#3), this new compiler does a creditable job of optimizing

array-index calculations in code involving two-dimensional arrays.

This does not mean that assembly language programming has been rendered

obsolete. In this article I would like to demonstrate (a) that it is still relatively easy

to “beat the optimizer” with hand-coded assembly language, and that (b) assembly

language code does not have to be e specially “tricky” or obscure to be effective.

This article will also demonstrate how to use both implementations of Apple’s

floating-point arithmetic package: “software SANE”, which is available on all

Macintoshes via a package in the System file, and “hardware SANE”, which is available

on 68020 and 68030 machines (Mac II or higher) via the 68881 or 68882

floating-point coprocessor chip. (Whenever I mention the 68030, I will also

implicitly include the 68020; likewise the 68882 will also “include” the 68881.)

I will not attempt to teach assembly language programming in general, but will

assume that you have at least some reading knowledge of 680x0 assembly language. For

those who are interested in learning about assembly language, some references are

listed at the end of the article.

Matrix Multiplication

As our “test case” I will use matrix multiplication. To refresh your memory

(or perhaps initialize it), recall that a matrix is simply a two-dimensional table of

numbers, typically represented in a computer program by a two-dimensional array.

By convention, we let the first and second indices be the row and column numbers

respectively. Thus in Pascal or C, A[5,2] denotes the number in the fifth row of the

second column of matrix A. In FORTRAN we use parentheses rather than square

brackets: A(5,2).

If the number of columns in matrix A is the same as the number of rows in

matrix B, or equivalently, if a row of A has the same length as a column of B, then we

can multiply A and B to form a new matrix, C. To calculate C(5,2), for example, we

take the fifth row of A and the second column of B, “match up” the corresponding

elements, multiply them pairwise and finally add up the products:

C(5,2) = A(5,1) * B(1,2)

+ A(5,2) * B(2,2)

+ A(5,3) * B(3,2)

+ ...

+ A(5,M) * B(M,2).

Multiplying two matrices requires three nested loops. The two outer loops cycle

over the rows and columns of C, while the innermost loop cycles over the terms in the

sum for each element of C. In Language Systems FORTRAN we might write the following

subroutine to perform matrix multiplication:

C 1
      SUBROUTINE MXMPY (A, B, C, L, M, N)
C------------------------------------------------
C  Performs the matrix multiplication A*B = C. 
C  The dimensions of the matrices must be related C  as shown below.
C------------------------------------------------
      IMPLICIT NONE
      INTEGER I, J, K, L, M, N
      EXTENDED A(L,M), B(M,N), C(L,N), SUM
      DO I = 1, L
         DO J = 1, N
            SUM = 0.0X0
            DO K = 1, M
               SUM = SUM + A(I,K) * B(K,J)
            END DO
            C(I,J) = SUM
         END DO
      END DO
      END

If you are familiar with FORTRAN, you will note that this code is not standard

FORTRAN 77. It uses several “extensions” of the language:

1. Each DO-loop is terminated with an END DO statement rather than a labeled

CONTINUE statement. The corresponding DO statements do not use a statement

number (label) to specify the end of the loop.

2. The IMPLICIT NONE statement turns off FORTRAN’s default type-specification

rules and forces the programmer to declare all variables with their data types, as

in Pascal.

3. The EXTENDED data type is an 80-bit or 96-bit floating point number as used by

the SANE packages on the Mac; which one is used depends on whether the routine

is compiled with the -mc68882 compiler option. Alternatively, for portability

to other systems, floating-point variables can be declared as REAL (32 bits) or

DOUBLE PRECISION (64 bits), in which case the compiler’s -extended option can

be used to force them to be compiled as if they were EXTENDED.

4. The constant 0.0 should be written in exponential notation as 0.0X0 to tell the

compiler that it’s supposed to be stored in EXTENDED format. Otherwise it will

be stored as a 32-bit REAL and converted to EXTENDED whenever used in

calculations, which wastes time. Again, the compiler’s -extended option will

force REAL constants to be stored as EXTENDED.

Non-FORTRAN programmers should pay special attention to the “variable

dimensions” of the arrays A, B and C. In a FORTRAN subroutine, an array argument

(or parameter, to you Pascal and C people) may be specified with variable dimensions

which are also specified as arguments. Both the array and the dimensions must be

arguments, not local variables. This allows a FORTRAN programmer to write a

subroutine which can perform the same operation on arrays and matrices of different

sizes.

Speed Tests

I wrote a simple test program [Listing #1] which sets up a matrix, multiplies

that matrix by itself ten times, records the time using the Toolbox TickCount function,

and divides by ten to get the time for a single matrix multiplication. The program

repeats this process for 5x5, 10x10, 20x20 and 50x50 matrices. (I will present

results only for 50x50.)

I also wanted to find out what proportion of this time was spent in actual

floating-point calculations as opposed to array-indexing overhead. To do this, I

compiled the subroutine to assembly-language source code (using the compiler’s -a

option), commented out the floating-point operations, renamed it DUMMY, and

assembled it using the MPW Assembler. The test program runs DUMMY ten times,

using the same matrix sizes as with the “real” MXMPY.

The Language Systems FORTRAN compiler has four levels of optimization,

numbered from 0 (no optimization) to 3 (maximum optimization). Repeating the

timing tests for each level revealed that for this example, levels 0 and 1 give the same

results, as do levels 2 and 3. (Of course, I constructed a separate DUMMY for each

optimization level!)

The timings I obtained for 50 x 50 matrix multiplication using the standard

SANE library (“software SANE”) are shown in lines 1 and 2 of the table at the end of

the article, for each of the two levels of optimization, on a Mac SE. The optimization

does in fact reduce the overhead significantly. However, the overhead is only a small

fraction of the total execution time, because software SANE is very slow, so we gain a

speed increase of only about 12%.

The only way to speed up software SANE significantly is to switch to a faster

machine. Lines 3 and 4 of the table show results for a Mac SE/30, demonstrating a

speed increase of about 5x for both levels of optimization.

We can improve the SE/30 figures by taking advantage of the 68882

floating-point coprocessor (FPU). The Language Systems FORTRAN compiler allows

you to specify that you want code which uses the FPU directly (“hardware SANE”).

Recompiling the subroutines and test program using the -mc68030 and -mc68882

options (and creating a new dummy subroutine) gave the results shown in lines 5 and 6

of the table.

Overall, the hardware-SANE versions are 6.6x (opt=1) and 7.9x (opt=2) faster

than the software-SANE versions. Interestingly, the indexing overhead runs faster in

the hardware-SANE versions, probably because fewer machine-language instructions

are necessary to set up the actual floating-point calculations.

How Matrices Work

Before proceeding further, we need to look more closely at how matrices are

implemented in high-level languages. When we store data in a matrix, just where does

it go, and when we retrieve data from a matrix, where does the program look for it?

Normally we picture a matrix as a table of numbers, laid out in a neat

two-dimensional pattern of rows and columns. However, computer memory is not

two-dimensional. It is a one-dimensional linear sequence of addresses. A

programming language that supports matrices must somehow “map” two dimensions

into one. There are two obvious ways to do this.

One way is to take the rows of the matrix one after the other and lay them out

“end to end”, forming a single long horizontal line. If we start at one end of the line and

examine one element after another in sequence until we come to the other end, we find

that the second index (the column number) passes through its complete range of values

repeatedly, with the first index (the row number) changing by one after each cycle.

This is called row-major ordering. It is the method used by Pascal, C, and most other

languages for allocating matrices.

The other way, as you might expect, is to take the columns of the matrix one

after the other and lay them out “end to end”, forming a single long vertical line. Now,

as we step from one element to the next, the first index varies more rapidly than the

second. This is called column-major ordering, and is used by FORTRAN, alone among

the major programming languages.

Using column-major ordering, we can easily verify that whenever a FORTRAN

program needs to access an element of a matrix, it must compute the address as follows:

addr B(K,J) = addr B(1,1) [ the start of the matrix ]

+ (J - 1) * n [ completely filled columns ]

+ (K - 1) * w [ last, incomplete, column ]

In this formula, n is the number of bytes in one complete column of the matrix,

and w is the number of bytes in one matrix element. I’ve kept things simple by

assuming that both the row and column indices have 1 as the lower bound, as is the most

common practice; if you like, you can figure out the formula for the more general case

as an exercise.

The indices J and K usually are variables, so their values have to be fetched from

memory as the program runs. If we declare the matrix in the usual way, by specifying

the size of the matrix explicitly, then the compiler can calculate the parameters n and

w and insert them into the code as constants, possibly as immediate operands in

arithmetic operations.

Referenced by (4):